1 Data Cleaning and Check

This application is designed to perform comprehensive data cleaning and provide detailed feedback on the process. Although the data cleaning operations are conducted behind the scenes, this section outlines the key steps and methodologies employed.

1.1 Data Cleaning and Manipulation

1.1.1 Data Cleaning

Country Code Conversion: This process transforms country names into their official three-letter codes (iso3c code), ensuring consistency. If a country name doesn’t match, it’s labeled as “Multi-country,” helping in organizing countries effectively.
Author Name Cleanup: The authors’ names are standardized by removing numbers, changing “and” to “&”, and including the publication year, making author information clearer and more consistent.
Publication Year Standardization: Converts the publication year to a standard numeric format, simplifying any analysis that involves time.
Missing Data Handling: Converts all instances of “N/A” in text fields to a standard “NA”, ensuring uniformity in how missing information is recorded.

1.1.2 Data Manipulation

Multiple Interventions Indicator: A new indicator is added to show when a study examines more than one intervention, aiding in the identification of complex studies.
Region Assignment: Each country is matched with its World Bank region. Unmatched countries are grouped under “Multi-country,” allowing for regional analysis to check the DEP coded data.
Income-Level Groupping: This step integrates World Bank data to add income level information for each country, enhancing the dataset with economic classifications.
COVID-19 Period Identification: A new marker is added to indicate whether a study overlaps with the COVID-19 peak outbreak (2020), highlighting research relevant to the pandemic.

1.2 Data Verification and Missing Values

1.2.1 Country Name Verification

For studies encompassing more than two countries, a unique “Multi-country” code is assigned. Such entries are excluded from geographical analysis to maintain the integrity of country-specific assessments. The ‘Data Check’ tab within the application documents this coding practice. Here, “Country” refers to the original country data, while “Coding used here” illustrates the application’s method of transforming these observations for analysis.

Figure 1:
Example of Country Name Verification

1.2.2 Analysis of Missing Values in Effect Size Data

The application employs a procedure to identify and document missing values in effect size data.

1.3 Dynamic Title Update Based on Data Validation

Upon uploading an Excel file and choosing a specific sheet, the application automatically updates the title to reflect the current selection and the state of the data. This update includes:

Sheet Selection: The title dynamically incorporates the name of the selected sheet, providing a clear context for the displayed data.
Outcomes Count: The number of outcomes, derived from the selected data, is prominently displayed in the title, offering immediate insight into the volume of data under consideration.
Data Quality Indication: If the application detects any issues with country names or missing values, it adds a notice to the title. This alert suggests users to visit the ‘Data Check’ tab for a detailed review and necessary corrections.

1.4 Takeaways

Automated Data Cleaning: Streamlines data preparation, offering feedback while ensuring transparency in the cleaning process.
Data Standardization: Implements country code conversions, author name cleanup, publication year formatting, and uniform handling of missing values for consistency across the dataset.
Data Enrichment: Introduces indicators for multiple interventions, COVID-19 relevance, and integrates geographical and economic classifications to deepen analysis capabilities.
Verification and Missing Values: Employs verification for multi-country studies and a systematic approach to identify missing values in effect size data, ensuring data integrity.
Dynamic User Feedback: Updates section titles based on sheet selection and data checks, guiding users to address data quality issues effectively.